TB case reports (Solutions to Exercises)¶

Data sets from http://www.who.int/tb/country/data/download/en/

In [36]:

library(tidyverse)
library(stringr)
library(forcats)

Exercise (Messy to tidy)¶

We start with a fairly messy version of the data set in tidyr::who.

In [37]:

who  <- tidyr::who

In [38]:

head(who)

country	iso2	iso3	year	new_sp_m014	new_sp_m1524	new_sp_m2534	new_sp_m3544	new_sp_m4554	new_sp_m5564	⋯	newrel_m4554	newrel_m5564	newrel_m65	newrel_f014	newrel_f1524	newrel_f2534	newrel_f3544	newrel_f4554	newrel_f5564	newrel_f65
Afghanistan	AF	AFG	1980	NA	NA	NA	NA	NA	NA	⋯	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Afghanistan	AF	AFG	1981	NA	NA	NA	NA	NA	NA	⋯	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Afghanistan	AF	AFG	1982	NA	NA	NA	NA	NA	NA	⋯	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Afghanistan	AF	AFG	1983	NA	NA	NA	NA	NA	NA	⋯	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Afghanistan	AF	AFG	1984	NA	NA	NA	NA	NA	NA	⋯	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Afghanistan	AF	AFG	1985	NA	NA	NA	NA	NA	NA	⋯	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Ex 1*.

Find the definition of the first 6 column names by doing a semi_join with dict.

In [40]:

labels  <- data.frame(name = colnames(who)[1:6])
labels

name
country
iso2
iso3
year
new_sp_m014
new_sp_m1524

In [42]:

dict_url <- "https://extranet.who.int/tme/generateCSV.asp?ds=dictionary"
if (!file.exists("dict.csv")) download.file(dict_url, "dict.csv")
dict <- read_csv('dict.csv')

Parsed with column specification:
cols(
  variable_name = col_character(),
  dataset = col_character(),
  code_list = col_character(),
  definition = col_character()
)

In [43]:

semi_join(dict, labels, by=c("variable_name" = "name"))

Warning message:
“Column `variable_name`/`name` joining character vector and factor, coercing into character vector”

variable_name	dataset	code_list	definition
country	Country identification	NA	Country or territory name
iso2	Country identification	NA	ISO 2-character country/territory code
iso3	Country identification	NA	ISO 3-character country/territory code
new_sp_m014	Notification	NA	New pulmonary smear positive cases: males aged 0-14 years (not used after 2012)
new_sp_m1524	Notification	NA	New pulmonary smear positive cases: males aged 15-24 years (not used after 2012)

Ex 2.

Create a new column called notification and gather all columns beginning wiith new there, such that their values are in a new column called cases. Remove NA values when using gather.

In [44]:

who1  <-  who %>% gather(starts_with('new'),
                         key = 'notification',
                         value = 'cases',
                         na.rm = TRUE)

In [45]:

head(who1)

country	iso2	iso3	year	notification	cases
Afghanistan	AF	AFG	1997	new_sp_m014	0
Afghanistan	AF	AFG	1998	new_sp_m014	30
Afghanistan	AF	AFG	1999	new_sp_m014	8
Afghanistan	AF	AFG	2000	new_sp_m014	52
Afghanistan	AF	AFG	2001	new_sp_m014	129
Afghanistan	AF	AFG	2002	new_sp_m014	90

Ex 3.

Replace all occurrences of the string newrel with new_rel

In [46]:

who2  <- who1 %>% mutate(notification = str_replace(notification, "newrel", "new_rel"))

In [47]:

head(who2)

country	iso2	iso3	year	notification	cases
Afghanistan	AF	AFG	1997	new_sp_m014	0
Afghanistan	AF	AFG	1998	new_sp_m014	30
Afghanistan	AF	AFG	1999	new_sp_m014	8
Afghanistan	AF	AFG	2000	new_sp_m014	52
Afghanistan	AF	AFG	2001	new_sp_m014	129
Afghanistan	AF	AFG	2002	new_sp_m014	90

Ex 4.

Split the notification column into new, type, sex, age. (Note: There are no sex unknown data here)

In [48]:

who3  <- who2 %>% separate(notification, into = c("new", "type", "sexage"))

In [49]:

head(who3)

country	iso2	iso3	year	new	type	sexage	cases
Afghanistan	AF	AFG	1997	new	sp	m014	0
Afghanistan	AF	AFG	1998	new	sp	m014	30
Afghanistan	AF	AFG	1999	new	sp	m014	8
Afghanistan	AF	AFG	2000	new	sp	m014	52
Afghanistan	AF	AFG	2001	new	sp	m014	129
Afghanistan	AF	AFG	2002	new	sp	m014	90

In [50]:

who4  <- who3 %>% separate(sexage, into = c("sex", "age"), sep=1)

In [51]:

head(who4)

country	iso2	iso3	year	new	type	sex	age	cases
Afghanistan	AF	AFG	1997	new	sp	m	014	0
Afghanistan	AF	AFG	1998	new	sp	m	014	30
Afghanistan	AF	AFG	1999	new	sp	m	014	8
Afghanistan	AF	AFG	2000	new	sp	m	014	52
Afghanistan	AF	AFG	2001	new	sp	m	014	129
Afghanistan	AF	AFG	2002	new	sp	m	014	90

Ex 5.

Drop the iso2, iso3, new columns.

In [52]:

who5  <-  who4 %>% select(-iso2, -iso3, -new)

In [53]:

head(who5)

country	year	type	sex	age	cases
Afghanistan	1997	sp	m	014	0
Afghanistan	1998	sp	m	014	30
Afghanistan	1999	sp	m	014	8
Afghanistan	2000	sp	m	014	52
Afghanistan	2001	sp	m	014	129
Afghanistan	2002	sp	m	014	90

Exercise (Exploratory data analysis)¶

Ex 1.

For each country compute the total number of cases of TB and save as ex1

In [54]:

ex1 <- who5 %>% group_by(country) %>% summarize(count = sum(cases))

In [55]:

head(ex1)

country	count
Afghanistan	140225
Albania	5335
Algeria	128119
American Samoa	41
Andorra	103
Angola	308365

Ex 2.

For each year, find the country that had the smear positive pulmonary TB. Display in chronological order.

In [56]:

who5 %>% filter(type == 'sp') %>%
group_by(year) %>%
filter(cases == max(cases)) %>%
select(year, country, cases) %>%
arrange(year)

year	country	cases
1980	Canada	186
1981	Canada	141
1982	Canada	150
1983	Canada	123
1984	Canada	169
1985	Canada	168
1986	Canada	147
1987	Canada	129
1988	Canada	131
1989	Canada	122
1990	Canada	100
1991	Canada	110
1992	Canada	79
1993	Canada	74
1994	Canada	87
1995	China	18306
1996	China	24057
1997	China	28247
1998	China	30093
1999	China	29328
2000	India	31090
2001	India	30007
2002	India	55829
2003	India	63587
2004	India	74450
2005	India	76870
2006	India	82939
2007	India	88045
2008	India	90498
2009	India	90830
2010	India	90440
2011	India	89706
2012	India	88111

3. Express the data for China as a data frame with just 3 columns - year, f and m where the values for f and m are the sum of female and male cases in that year. Show just the 5 years with highest total count of cases.

In [57]:

who5 %>%
filter(country == 'China') %>%
group_by(year, sex) %>%
summarize(count = sum(cases)) %>%
spread(key = sex, value = count) %>%
arrange(desc(f + m)) %>%
head(5)

year	f	m
2009	270530	613947
2010	268998	600094
2011	263933	601126
2012	265110	593751
2013	261049	586127

Ex 4.

Make a scatter plot of the number of reports (each row is a report) of TB in each country with the regions on the y axis. Restrict to countries that begin with the letter A. Order the countries by number of reports and remove the y label.

In [58]:

options(repr.plot.width=4, repr.plot.height=3)

In [59]:

foo  <- who5 %>%
filter(str_detect(country, '^A')) %>%
group_by(country) %>%
summarise(n = n())

foo %>%
ggplot(aes(n, reorder(country, n))) +
geom_point() +
ylab(NULL)

Data type cannot be displayed:

_images/CFAR_R_Workshop_2018_Exercisees_33_1.png

Ex 5.

Display ex1 as a bar chart. Because of the large number of countries, place the countries on the y-axis, and display countries from fewest to most cases. Scale the counts using a log10 scale, adding 1 to each count so we don’t have to deal with negative infinity.

In [60]:

options(repr.plot.width=6, repr.plot.height=30)

In [61]:

ggplot(ex1) +
geom_bar(aes(x=reorder(country, -count), y=1+count), stat='identity') +
xlab(NULL) +
ylab("Log10(TB cases)") +
scale_y_log10() +
coord_flip()

Data type cannot be displayed:

_images/CFAR_R_Workshop_2018_Exercisees_36_1.png

Ex 6.

- Make a bar chart of the number of reports (each row is a report) of TB in each country that begin with the letter A. Order the countries by number of reports increasing from left to right. Rotate the country labels so they are readable.

In [62]:

options(repr.plot.width=4, repr.plot.height=3)

In [63]:

who5 %>%
filter(str_detect(country, '^A')) %>%
ggplot() +
geom_bar(aes(x = country %>% fct_infreq %>% fct_rev)) +
theme(axis.text.x = element_text(angle = 90)) +
xlab(NULL)

Data type cannot be displayed:

_images/CFAR_R_Workshop_2018_Exercisees_39_1.png

Ex 7.

Can you display as proportion instead of count?

In [64]:

who5 %>%
filter(str_detect(country, '^A')) %>%
ggplot() +
geom_bar(aes(x = country %>% fct_infreq %>% fct_rev,
             y = ..prop..,
             group = 1)) +
theme(axis.text.x = element_text(angle = 90)) +
xlab(NULL)

Data type cannot be displayed:

_images/CFAR_R_Workshop_2018_Exercisees_41_1.png

Ex 8.

Add a label showing the first 3 letters of the country label on top of each bar, and remove the x tick and axis labels. Make the bars 50% transparent and fill by prop using a color gradient from yellow to red.

In [65]:

ex8  <- who5 %>%
filter(str_detect(country, '^A')) %>%
mutate(country = country %>% str_sub(1, 3) %>% str_to_upper) %>%
group_by(country) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))

In [66]:

ex8

country	n	prop
AFG	244	0.056286044
ALB	448	0.103344867
ALG	224	0.051672434
AME	172	0.039677047
AND	387	0.089273356
ANG	425	0.098039216
ANT	346	0.079815456
ARG	448	0.103344867
ARM	461	0.106343714
ARU	37	0.008535179
AUS	826	0.190542099
AZE	317	0.073125721

In [67]:

options(repr.plot.width=7, repr.plot.height=3)

In [69]:

ex8 %>%
ggplot(aes(x = reorder(country, prop),
           y = prop,
           fill  = prop))  +
geom_bar(stat = 'identity', alpha = 0.5) +
geom_text(label = reorder(ex8$country, ex8$prop),
          nudge_y = .02) +
xlab(NULL) +
theme(axis.text.x = element_blank(),
      axis.ticks.x = element_blank()) +
scale_fill_gradient(low = "yellow", high="red") +
guides(fill = FALSE)

Data type cannot be displayed:

_images/CFAR_R_Workshop_2018_Exercisees_46_1.png

Ex 9.

Make a bar chart of counts for male and female side by side with countries starting with A on the y axis
Set the fill of the bar chart to pink for f and blue for m
Add a meaningful title

In [70]:

options(repr.plot.width=6, repr.plot.height=6)

In [71]:

who5 %>%
filter(str_detect(country, '^A')) %>%
group_by(country, sex) %>%
mutate(count = sum(cases)) %>%
ggplot +
geom_bar(aes(x=reorder(country, count),
             y=count, fill=sex),
         position='dodge',
         stat='identity') +
scale_fill_manual(values = c("f" = "pink", "m" = "blue")) +
labs(title="TB cases by region and sex") +
coord_flip()

Data type cannot be displayed:

_images/CFAR_R_Workshop_2018_Exercisees_49_1.png

Ex 10.

Make a grid of plots where each column is a different sex), each row is a different country starting with A), and each subplot is a plot of trend over time. Allow the y - scales to vary across plots.Make the fitted curves for each sex have a different color.

In [72]:

options(repr.plot.width=6, repr.plot.height=15)

In [35]:

g  <- who5 %>% filter(str_detect(country, '^A')) %>%
ggplot(aes(x = year, y=cases, color=sex)) +
geom_smooth() +
facet_grid(country ~ sex, scales="free")
suppressWarnings(print(g))

`geom_smooth()` using method = 'loess' and formula 'y ~ x'

_images/CFAR_R_Workshop_2018_Exercisees_52_1.png

In [ ]:

TB case reports (Solutions to Exercises)¶

Exercise (Messy to tidy)¶

Exercise (Exploratory data analysis)¶

Page contents

This Page